Conversion between Scripts of Punjabi: Beyond Simple Transliteration
نویسندگان
چکیده
This paper describes statistical techniques used for modelling transliteration systems between the scripts of Punjabi language. Punjabi is one of the unique languages, which are written in more than one script. In India, Punjabi is written in Gurmukhi script, while in Pakistan it is written in Shahmukhi (Perso-Arabic) script. Shahmukhi script has its origin in the ancient Phoenician script whereas Gurmukhi script has its origin in the ancient Brahmi script. Whilst in speech Punjabi spoken in the Eastern and the Western parts is mutually comprehensible, in the written form it is not so. This has created a script wedge as majority of Punjabi speaking people in Pakistan cannot read Gurmukhi script, and similarly the majority of Punjabi speaking people in India cannot comprehend Shahmukhi script. In this paper, we present an advanced and highly accurate transliteration system between Gurmukhi and Shahmukhi scripts of Punjabi language which addresses various challenges such as multiple/zero character mappings, missing vowels, word segmentation, variations in pronunciations and orthography and transliteration of proper nouns etc. by generating efficient algorithms along with special rules and using various lexical resources such as Gurmukhi spell checker, corpora of both scripts, Gurmukhi-Shahmukhi transliteration dictionaries, statistical language models etc. The proposed system attains more than 98.6% accuracy at word level while transliterating Gurmukhi text to Shahmukhi. The reverse part i.e. transliterating from Shahmukhi text to Gurmukhi is more complex and challenging but our system has achieved 97% accuracy at word level in this part too.
منابع مشابه
Punjabi Machine Transliteration
Machine Transliteration is to transcribe a word written in a script with approximate phonetic equivalence in another language. It is useful for machine translation, cross-lingual information retrieval, multilingual text and speech processing. Punjabi Machine Transliteration (PMT) is a special case of machine transliteration and is a process of converting a word from Shahmukhi (based on Arabic s...
متن کاملShahmukhi to Gurmukhi Transliteration System
The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and Pakistan. This research has developed a new system for the first time of its kind for Shahmukhi text without diacritical marks. The purposed system for Shahmukhi to Gurmukhi transliteration has been implemented with various research techniques based on language corp...
متن کاملShahmukhi to Gurmukhi Transliteration System: A Corpus based Approach
This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This research project has developed a new system for the first time of its kind for Shahmukhi script of Punjabi language. The proposed system for Shahmukhi to Gurm...
متن کاملStatistical Approach to Transliteration from English to Punjabi
-Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms. Transliteration is a crucial factor in CLIR and MT. It is important for Machine Translation, especially when the languages do not use the same scripts. This paper addresses the issue of statistical mach...
متن کاملSangam: A Perso-Arabic to Indic Script Machine Transliteration Model
Indian sub-continent is one of those unique parts of the world where single languages are written in different scripts. This is the case for example with Punjabi, written in Indian East Punjab in Gurmukhi script (a Left to Right script based on Devnagri) and in Pakistani West Punjab, it is written in Shahmukhi (a Right to Left script based on Perso-Arabic). This is also the case with other lang...
متن کامل